Example for Si active learning
This case is an active learning process for a silicon system. The case is located in pwact/example/si_pwmat/
. It starts by constructing the initial training set using INIT_BULK
. Then, the model is trained using the initial training set, and active learning sampling is performed at temperatures of 300K
, 500K
, and 900K
using perturbed structures generated by INIT_BULK
.
The DFT calculation software used in this case is PWMAT. VASP version input files are provided in pwact/example/si_vasp
, CP2K version in pwact/example/si_cp2k
and DFTB version in pwact/example/si_dftb
.
Please note
that the DFT settings provided in the case are only for program execution process testing and do not guarantee calculation accuracy.
INIT_BULK
Launch Command
pwact init_bulk init_param.json resource.json
INIT_BULK Directory Structure
The directory structure of INIT_BULK is as follows. atom.config
, POSCAR
, resource.json
, init_param.json
, relax_etot.input
, relax_etot1.input
, aimd_etot1.input
, aimd_etot2.input
are input files, collection
is the result summary directory, and pwdata
is the directory where the pre-training data is located.
collection directory
The init_config_0
directory contains the results of atom.config after relaxation, supercell generation, lattice scaling, perturbation, and AIMD.
relax.config
is the structure file obtained by relaxing atom.config
. super_cell.config
is the structure file obtained by generating a supercell from relax.config
. 0.9_scale.config
and 0.95_scale.config
are the structure files obtained by scaling the lattice of super_cell.config
.
The 0.9_scale_pertub
directory includes 30 structures obtained by perturbing the positions of atoms in the 0.9_scale.config
structure.
The pwdata
directory contains the results of extracting trajectories in the pwdata format from the perturbed structures using aimd_etot1.input
. It includes train
and valid
subdirectories for the training and testing sets, respectively.
For the train
(or valid
) directory, atom_type.npy
contains the atomic types in the structure, position.npy
contains the positions of atoms, and energies.npy
, forces.npy
, ei.npy
, virials.npy
contain the total energy, atomic forces in three directions, atomic energies, and virial information of the structure, respectively. ei.npy
and virials.npy
are optional files and are only extracted if the trajectory includes atomic energies and virials.
directory1
example/init_bulk
├──atom.config
├──POSCAR
├──resource.json
├──init_param.json
├──relax_etot.input
├──relax_etot1.input
├──aimd_etot1.input
├──aimd_etot2.input
├──pwdata
└──collection
├──init_config_0
│ ├──super_cell.config
│ ├──0.9_scale.config
│ ├──0.9_scale_pertub
│ │ ├──0.9_scale.config
│ │ ├──0_pertub.config
│ │ ├──1_pertub.config
│ │ ├──2_pertub.config
│ │ ...
│ │ └──30_pertub.config
│ ├──0.95_scale.config
│ ├──0.95_scale_pertub
│ │ ├──0.95_scale.config
│ │ ├──0_pertub.config
│ │ ├──1_pertub.config
│ │ ├──2_pertub.config
│ │ ...
│ │ └──30_pertub.config
│ ├──PWdata
│ │ ├──train
│ │ │ ├──atom_type.npy
│ │ │ ├──energies.npy
│ │ │ ├──image_type.npy
│ │ │ ├──position.npy
│ │ │ ├──ei.npy
│ │ │ ├──forces.npy
│ │ │ ├──lattice.npy
│ │ │ ├──virials.npy
│ │ ├──valid
│ │ │ ├──atom_type.npy
│ │ │ ├──energies.npy
│ │ │ ├──image_type.npy
│ │ │ ├──position.npy
│ │ │ ├──ei.npy
│ │ │ ├──forces.npy
│ │ │ ├──lattice.npy
│ │ │ ├──virials.npy
│ └──valid
├──init_config_1
...
0.95_scale_pertub 0.9_scale_pertub relaxed.config train
├──init_config_2
Active Learning
We perform active learning using pre-training data and perturbed structures from the INIT_BULK case at temperatures of 500K
, 800k
and 1100K
.
To start the process, after executing the init_bulk command, navigate to the pwact/example/si_pwmat/
directory and run the following command:
pwact run param.json resource.json
Active Learning Directory Structure
The directory structure of active learning is as follows.
param.json
and resource.json
are the input control files for active learning, and scf_etot.input
is the input file for self-consistent field (SCF) calculations in active learning.
si.al
is the record file for the active learning process, documenting the executed active learning steps.
iter_result.txt
is a record of active learning that explores and selects structures for annotation in each round. The content is shown in the following example.
iter.0000 Total structures 404 accurate 122 rate 30.20% selected 187 rate 46.29% error 95 rate 23.51%
iter.0001 Total structures 404 accurate 334 rate 82.67% selected 70 rate 17.33% error 0 rate 0%
iter.0000
is the directory for the first iteration of active learning, iter.0001
is for the second iteration, and so on.
train
, explore
, and label
are the directories for the training, exploration, and labeling tasks, respectively, in each iteration of active learning.
train directory
In the train
directory, we adopt a committee query strategy with 4 models. There are 4
training tasks, where the 4
models have identical configurations except for the different initialization values of their network parameters.
The 0-train.job
, 1-train.job
, 2-train.job
, and 3-train.job
are the slurm job scripts for executing the training tasks. After the training tasks are completed, four tag files (0-tag.train.success
, 1-tag.train.success
, 2-tag.train.success
, 3-tag.train.success
) indicating the successful completion of the training tasks are generated. In addition, the four models are saved in the train.000
, train.001
, train.002
, and train.003
directories.
Taking the train.000
directory as an example.
train.json
is the input file for the PWMLFF model training. std_input.json
summarizes the training parameter settings outputted by PWMLFF. The model_record
directory is where the models are saved, dp_model.ckpt
is the model file, and epoch_train.dat
contains the average training errors at each epoch, representing the average error on the validation set after each epoch of training is saved in the epoch_valid.dat
file.
torch_script_module.pt
is the compiled model file after compiling dp_model.ckpt
using the jitscript tool
. It is used as a force field for simulations in lammps.
explore directory
The explore
directory consists of two subdirectories: md
and select
.
The md
directory is for exploration in active learning and includes files (input files, trajectories) for molecular dynamics simulations using the PWMLFF force field for different temperatures, pressures, etc.
Subsequently, the simulated structures (trajectories) are filtered based on the upper and lower bounds set by the committee query method, and the filtered results are saved in the select
subdirectory.
md subdirectory
The md subdirectory includes two subdirectories, the directory name is md.***.sys.***/md.***.sys.***.t.***
, for example md.000.sys.000/md.000.sys.000.t.000
, for md.000.sys.000
directory, here md.000
000 refers to the first md setting in param.json
md_jobs
; sys.000 is the structure corresponding to the index 0 of sys_index
. It includes md.000.sys.000.t.000
and md.000.sys.000.t.001
two subdirectories, representing the molecular dynamics simulation at the temperature corresponding to the index 0
and 1
of temps
.
select subdirectory
The .csv
files contain three columns: force deviation (devi_force)
, structure index in the trajectory (config_index)
, and trajectory file path (file_path)
.
accurate.csv
summarizes the structures with force deviations less than the set lower_model_deiv_f
.
fail.csv
contains the structures with force deviations greater than the set upper_model_deiv_f
.
If the number of candidate structures (structures with force deviations between the set upper and lower bounds) exceeds the set maximum number of selected points max_select
, then max_select
structures are randomly selected from the candidate structures and saved in the candidate.csv
file, and the rest are saved in the candidate_delete.csv
file.
select_summary.txt
summarizes the data on the number of selected points, as shown in the following example.
Total structures 1212 accurate 16 rate 1.32% selected 403 rate 33.25% error 793 rate 65.43%
Select by model deviation force:
Accurate configurations: 16, details in file accurate.csv
Candidate configurations: 403, randomly select 10, delete 393
Select details in file candidate.csv
Delete details in file candidate_delete.csv.
Error configurations: 793, details in file fail.csv
label directory
The label
directory includes the scf
and result
subdirectories. The scf
directory contains the directories for performing self-consistent calculations on the selected points obtained from the explore
directory.
After the self-consistent calculations, the structures along with their corresponding energies, forces, atomic energies, and virials are extracted and saved in the result
subdirectory as pwdata format.
scf
The first and second-level subdirectories under scf
with the format md.*.sys.*/md.*.sys.*.t.*
have the same structure and naming significance as the md subdirectories. Within the second-level subdirectories, there are directories for self-consistent field (scf) calculations. Taking the example of the scf/md.000.sys.000/md.000.sys.000.t.000/820-scf
directory, here 820 refers to the structure corresponding to the 820th step of the md.000.sys.000/md.000.sys.000.t.000
trajectory. In this directory, 820.config
represents the input structure for the scf calculation, etot.input
is the input control file for the scf calculation, and REOPORT
and OUT.MLMD
are output files from PWMAT. The OUT.MLMD
file contains information such as atomic positions, energy, and forces after the scf calculation.
result
The first-level subdirectories under result
correspond to all second-level subdirectories of the scf directory.
Taking result/md.000.sys.000.t.000
as an example. It corresponds to a data directory where all scf calculation results from scf/md.000.sys.000/md.000.sys.000.t.000
are extracted and stored in the pwdata format. It includes two subdirectories, train
and valid
, which are the same as the pwdata
directory in the init_bulk example.
directory2
example
├──param.json
├──resource.json
├──scf_etot.input
├──si.al
├──iter_result.txt
├──iter.0000
│ ├──train
│ │ ├──0-train.job
│ │ ├──1-train.job
│ │ ├──3-train.job
│ │ ├──2-train.job
│ │ ├──train.000
│ │ │ ├──train.json
│ │ │ ├──std_input.json
│ │ │ ├──model_record
│ │ │ │ ├──dp_model.ckpt
│ │ │ │ ├──epoch_train.dat
│ │ │ │ └──epoch_valid.dat
│ │ │ ├──torch_script_module.pt
│ │ │ └──tag.train.success
│ │ ├──train.001
│ │ │ └──...
│ │ ├──train.002
│ │ │ └──...
│ │ ├──train.003
│ │ │ └──...
│ │ ├──0-tag.train.success
│ │ ├──1-tag.train.success
│ │ ├──2-tag.train.success
│ │ └──3-tag.train.success
│ ├──explore
│ │ ├──md
│ │ │ ├──md.000.sys.000
│ │ │ ├──md.000.sys.001
│ │ │ ├──md.001.sys.000
│ │ │ └──md.001.sys.003
│ │ └──select
│ │ ├──accurate.csv
│ │ ├──candidate.csv
│ │ ├──candidate_delete.csv
│ │ ├──fail.csv
│ │ └──select_summary.txt
│ └──label
│ │ ├──scf
│ │ │ ├──md.000.sys.000
│ │ │ │ ├──md.000.sys.000.t.001
│ │ │ │ │ ├──820-scf
│ │ │ │ │ │ ├──820.config
│ │ │ │ │ │ ├──etot.input
│ │ │ │ │ │ ├──REPORT
│ │ │ │ │ │ └──OUT.MLMD
│ │ │ │ │ ├──200-scf
│ │ │ │ │ └──...
│ │ │ │ └──md.000.sys.000.t.000
│ │ │ │ └──...
│ │ │ ├──md.000.sys.001
│ │ │ ├──...
│ │ │ ├──md.001.sys.000
│ │ └──result
│ │ ├──md.000.sys.000.t.000
│ │ ├──md.000.sys.000.t.001
│ │ ├──...
│ │ └──md.001.sys.000.t.001.p.000
├──iter.0001
│ └──...
├──...